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Abstract. We present simple and efficient algorithms for calculating q- 
gram frequencies on strings represented in compressed form, namely, as 
a straight line program (SLP). Given an SLP of size n that represents 
string T, we present an 0{qn) time and space algorithm that computes 
the occurrence frequencies of all g-grams in T. Computational experi- 
ments show that our algorithm and its variation are practical for small 
q, actually running faster on various real string data, compared to algo- 
rithms that work on the uncompressed text. We also discuss applications 
in data mining and classification of string data, for which our algorithms 
can be useful. 



1 Introduction 

A major problem in managing large scale string data is its sheer size. Therefore, 
such data is normally stored in compressed form. In order to utilize or analyze 
the data afterwards, the string is usually decompressed, where we must again 
confront the size of the data. To cope with this problem, algorithms that work 
directly on compressed representations of strings without explicit decompression 
have gained attention, especially for the string pattern matching problem [T] 
where algorithms on compressed text can actually run faster than algorithms on 
the uncompressed text [23]. There has been growing interest in what problems 
can be efficiently solved in this kind of setting |17I8| . 

Since there exist many different text compression schemes, it is not realistic 
to develop different algorithms for each scheme. Thus, it is common to consider 
algorithms on texts represented as straight line programs (SLPs) [1211718] . An 
SLP is a context free grammar in the Chomsky normal form that derives a 
single string. Texts compressed by any grammar-based compression algorithms 
(e.g. |21I15| ) can be represented as SLPs, and those compressed by the LZ-family 
(e.g. [24125] ) can be quickly transformed to SLPs [35]. Recently, even compressed 
self-indices based on SLPs have appeared [6], and SLPs are a promising repre- 
sentation of compressed strings for conducting various operations. 

In this paper, we explore a more advanced field of application for compressed 
string processing: mining and classification on string data given in compressed 
form. Discovering useful patterns hidden in strings as well as automatic and 
accurate classification of strings into various groups, are important problems in 
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the field of data mining and niacliinc learning with many applications. As a first 
step toward compressed string mining and classification, we consider the problem 
of finding the occurrence frequencies for all (/-grams contained in a given string, 
g-grams are important features of string data, widely used for this purpose in 
many fields such as text and natural language processing, and bioinformatics. 

In [TU], an 0(|Z'p?i^)-time 0(7i^)-space algorithm for finding the most fre- 
quent 2-gram from an SLP of size n representing text T over alphabet S was 
presented. In [5], it is mentioned that the most frequent 2-gram can be found 
in 0(|i7pnlogn)-time and 0(nlog |T|)-space, if the SLP is pre-processed and a 
self- index is built. It is possible to extend these two algorithms to handle g-grams 
for g > 2, but would respectively require 0{\F!\^qn'^) and 0{\S\'^qn\ogn) time, 
since they must essentially enumerate and count the occurrences of all substrings 
of length q. regardless of whether the g-gram occurs in the string. Note also that 
any algorithm that works on the uncompressed text T requires exponential time 
in the worst case, since |T| can be as large as 0(2"). 

The main contribution of this paper is an 0{qn) time and space algorithm 
that computes the occurrence frequencies for all g-grams in the text, given an 
SLP of size n representing the text. Our new algorithm solves the more general 
problem and greatly improves the computational complexity compared to pre- 
vious work. We also conduct computational experiments on various real texts, 
showing that when q is small, our algorithm and its variation actually run faster 
than algorithms that work on the uncompressed text. 

Our algorithms have profound applications in the field of string mining and 
classification, and several applications and extensions are discussed. For example, 
our algorithm leads to an 0{q{ni -1-^2)) time algorithm for computing the g-gram 
spectrum kernel [TB] between SLP compressed texts of size ni and 712. It also 
leads to an 0{qn) time algorithm for finding the optimal g-gram (or emerging 
g-gram) that discriminates between two sets of SLP compressed strings, when n 
is the total size of the SLPs. 

Related Work There exist many works on compressed text indices [20] , but the 
main focus there is on fast search for a given pattern. The compressed indices 
basically replace or simulate operations on uncompressed indices using a smaller 
data structure. Indices are important for efficient string processing, but note that 
simply replacing the underlying index used in a mining algorithm will generally 
increase time complexities of the algorithm due to the extra overhead required 
to access the compressed index. On the other hand, our approach is a new 
mining algorithm which exploits characteristics of the compressed representation 
to achieve faster running times. 

Several algorithms for finding characteristic sequences from compressed texts 
have been proposed, e.g., finding the longest common substring of two strings |19j . 
finding all palindromes }19j . finding most frequent substrings [TU], and finding 
the longest repeating substring [TU]. However, none of them have reported results 
of computational experiments, implying that this paper is the first to show the 
practical usefulness of a compressed text mining algorithm. 



Algorithm 1: Calculating vOcc{Xi) for all 1 < i < n. 

Input: SLP T — {Xi}f^i representing string T. 
Output: vOcc{Xi) for all 1 < i < n 

1 vOcc[Xn] ^ 1; 

2 for i 1 to n — 1 do vOcc[Xi] ^ 0; 

3 for i to 2 do 

4 if Xi = XiXr then 

5 |_ vOcc[Xi] ^ vOcc[Xt] + vOcc[Xi]; vOcc[Xr] ^ vOcc[Xr] + vOcc[Xi 



2 Preliminaries 

Let be a finite alphabet. An element of E* is called a string. For any integer 
g > 0, an element of 17' is called an q-gram. The length of a string T is denoted 
by |r|. The empty string £ is a string of length 0, namely, |e| = 0. For a string 
T = XY Z , A, Y and Z arc called a prefix^ substring, and suffix of T, respectively. 
The i-th character of a string T is denoted by T[i] for 1 < i < |T|, and the 
substring of a string T that begins at position i and ends at position j is denoted 
by T[i : j] for 1 < i < j < |r|. For convenience, let T[i : j] ^ e if j < i. 

For a string T and integer q > 0, let pre{T, q) and .suf{T, q) represent respec- 
tively, the length-q prefix and suffix of T. That is, pre{T, q) = T[l : min((7, |r|)] 
and suf{T, q) = rjmax(l, \T\-q+l): \T\]. 

For any strings T and P, let Occ{T, P) be the set of occurrences of P in T, 
i.e., Occ{T,P) = {fc > I r[fc : fc + |P| - 1] = P}. The number of elements 
\Occ{T, P)\ is called the occurrence frequency of P in T. 



2.1 Straight Line Programs 

A straight line program (SLP) T is 
a sequence of assignments Xi — 
expri,X2 = expr2, . . . , A„ = ea;pr„, 
where each Xi is a variable and each 
expVi is an expression, where expri — 
a {a £ S), or expri = A^ A^ (€, r < i). 
Let wa/(Ai) represent the string de- 
rived from Xi. When it is not con- 
fusing, we identify a variable A, with j_ ^j^^ derivation tree of SLP T = 

val{Xi). Then, |A,| denotes the length {x,}]^^ with Xi = a, X2 = b, A3 = X1A2, 
of the string A, derives. An SLP T X4 = X1X3, X5 = X3X4, Xe = A4A5, 
represents the string T = ■ya/(A„). and A7 = XgXs, representing string T = 
The si^e of the program T is the num- val{X7) — aababaababaab. 
ber n of assignments in T. (See Fig. [T|) 

The substring intervals of T that each variable derives can be defined recur- 
sively as follows: itv{Xn) ^ {[1 : |T|]}, and itv{Xi) = {[u + \Xi\ : v\ \ Xk = 
XeXi, [u:v\& itv{Xk)] U {[m : w -I- |A^| - 1] I Afc = A^A^, [u : v] & itv{Xk)} for 



Xi 



X6 



Xi 



Xa 

/ \ 

X\ X3 
; / \ 



Xs 



Jf3 
/ \ 



X4 



X\ Xi X\ Xi X\ X-i 

/ \ 

X\ Xi 



Xi Xa 

/ \ / \ 

Xi Xi X\ X-i 
/ \ 
X\ Xi 



a b a b 
2 3 4 5 



6 7 



b a b a a b 
8 9 10 11 12 13 



Algorithm 2: A iiai've algorithm for computing q-gram frequencies. 



Input: string T, integer q > 1 

Report: {P,\Occ{T, P)\) for all P e E" where Occ{T,P) / 

1 8^0;// empty associative array 

2 for i ^ 1 to |T| - g + 1 do 

3 qgram T[i : i + q — 1]; 

4 if qgram G keys(S) then S[qgram] <— S[qgram] + 1; 

5 else S[(?(5iram] <—!;// new g-gram 

6 for qgram G keys(S) do Report (ggrarra, Sfggram]) 



i < 71. For example, itv{Xr^) = {[4 : 8], [9 : 13]} in Fig. [TJ Considering the transi- 
tive reduction of set inclusion, the intervals U'2^-^^^tv{Xi) naturally form a binary 
tree (the derivation tree). Let vOcc{Xi) = \itv{Xi)\ denote the number of times a 
variable Xi occurs in the derivation of T. vOcc(Xi) for all 1 < i < n can be com- 
puted in 0{n) time by a simple iteration on the variables, since vOcc{Xn) = 1 
and for z < n, vOcciX^) = J2{vOcc{Xk) \ Xu = X^X,} + Y.{^Occ[Xk) \ Xk = 
X,Xr}. (See Algorithm HI) 



2.2 Suffix Arrays and LCP Arrays 

The suffix array SA [18] of any string T is an array of length |r| such that 
SA[i\ = j, where T[j : \T\\ is the i-ih lexicographically smallest suffix of T. 
The Icp array of any string T is an array of length |r| such that LCP[i] is the 
length of the longest common prefix of T[SA[i ~ 1] : |r|] and T[SA[i\ : \T\] for 
2 <i < |T|, and LCP[\\ = 0. The sufhx array for any string of length |r| can be 
constructed in 0(|r|) time (e.g. [11]) assuming an integer alphabet. Given the 
text and suffix array, the Icp array can also be calculated in 0(|r|) time [13] . 



3 Algorithm 

3.1 Computing q-gram Frequencies on Uncompressed Strings 

We describe two algorithms (Algorithm [5] and Algorithm [3]) for computing the 
g-gram frequencies of a given uncompressed string T. 

A nai've algorithm for computing the g-gram frequencies is given in Algo- 
rithm [51 The algorithm constructs an associative array, where keys consist of 
g-grams, and the values correspond to the occurrence frequencies of the g-grams. 
The time complexity depends on the implementation of the associative array, but 
requires at least 0{q\T\) time since each g-gram is considered explicitly, and the 
associative array is accessed 0(|T|) times: e.g. 0((7|T| log |Z'|) time and 0(g|r|) 
space using a simple trie. 

The g-gram frequencies of string T can be calculated in 0(|r|) time using 
suffix array 5^4 and Icp array LCP, as shown in Algorithm[3l For each 1 < i < 
the suffix SA[i\ represents an occurrence of g-gram r[5'^[i] : ^^[i] -|- g — 1], if the 



Algorithm 3: A linear time algorithm for computing g-gram frequencies. 



Input: string T, integer q > 1 

Report: (i, | Occ{T, P)\) for all P e E'' and some position i G Occ{T, P). 

1 SA ^ SUFFIXARRAY{T); LCP ^ LCPARRAY {T, SA); count ^ 1; 

2 for z ^ 2 to irl + 1 do 



if I = |T1 + 1 or LCP[i\ < q then 
1^ if count > then Report {SA[i — 1], count); count ^ 0; 

if i < \T\ and SA[i\ < |r| - Q + 1 then count ^ count + 1; 



suffix is long enough, i.e. SA[i] < |r| — g + 1. The key is that since the suffixes 
are lexicographically sorted, intervals on the suffix array where the values in the 
Icp array are at least q represent occurrences of the same g-gram. The algorithm 
runs in 0(|T|) time, since SA and LCP can be constructed in 0(|T|). The rest is 
a simple 0(|r|) loop. A technicality is that we encode the output for a g-gram as 
one of the positions in the text where the g-gram occurs, rather than the g-gram 
itself. This is because there can be a total of 0(|T|) different g-grams, and if we 
output them as length-g strings, it would require at least 0(g|T|) time. 



3.2 Computing q-gram Frequencies on SLP 

We now describe the core idea of our algorithms, and explain two variations 
which utilize variants of the two algorithms for uncompressed strings presented 
in Section 13.11 For g = 1 , the 1-gram frequencies are simply the frequencies of 
the alphabet and the output is (a, ^{?;Occ(Ai) | Xi ~ a}) for each a G S, which 
takes only 0{n) time. For g > 2, we make use of Lemma [1] below. The idea is 
similar to the mk Lemma [5], but the statement is more specific. 

Lemma 1. Let T ~ {Ai}"^]^ be an SLP that represents string T. For an interval 
[u : v] (1 < u < 1' < |T|), there exists exactly one variable A; = A^A^ such that 
for some [u' : v'] € itv{Xi), the following holds: [u : w] C [u' : v'], u G [u' : 
u' + |Af I - 1] e itv{Xi) and v £ [u' + \Xi\ : v'] € itv(Xr). 

Proof. Consider length 1 intervals [u : u] and [v : v] corresponding to leaves 
in the derivation tree. Aj corresponds to the lowest common ancestor of these 
intervals in the derivation tree. y. ^ 

From Lemma [U each occurrence of a g-gram 
(g > 2) represented by some length-g interval of 

T, corresponds to a single variable Xi = XgXr, :'' g-i g-i * 

and is split in two by intervals corresponding 
to Xi and Xr- On the other hand, consider all 
length-g intervals that correspond to a given vari- 
able. Counting the frequencies of the g-grams 
they represent, and summing them up for all vari- Fig. 2. Length-g intervals cor- 
ables give the frequencies of all g-grams of T. responding to Xi — XeXr- 




For variable Xi = X^Xr, let i,; = suf{Xi,q — l)pre{Xr,q — !)• Then, all 
5-grams represented by length q intervals that correspond to Xi arc those in ti. 
(Fig. [5]) . If we obtain the frequencies of all g-grams in ti , and then multiply each 
frequency by vOcc{Xi), we obtain frequencies for the g-grams occurring in all 
intervals derived by Xi. It remains to sum up the g-gram frequencies of ti for all 
1 < i < n. We can regard it as obtaining the weighted q-gram frequencies in the 
set of strings {ti, . . . , t„}, where each g-gram in ti is weighted by vOcc{Xi). 

We further reduce this problem to a weighted g-gram frequency problem for 
a single string z as in Algorithm |4l String z is constructed by concatenating ti 
such that q < \ti\ < 2(q — 1), and the weights of q-grams starting at each position 
in z is held in array w. On line [8l O's instead of vOcc{Xi) arc appended to w 
for the last q — 1 values corresponding to ti . This is to avoid counting unwanted 
g-grams that are generated by the concatenation of ti to z on line IHl which are 
not substrings of each ti. The weighted g-gram frequency problem for a single 
string (Line [5]) can be solved with a slight modification of Algorithm [5] or [31 The 
modified algorithms are shown respectively in Algorithms [5] and 15] 

Theorem 1. Given an SLP T = {Xi}^^^ of size n representing a string T, the 
q-gram frequencies of T can he computed in 0{qn) time for any q > 0. 

Proof. Consider Algorithmic) The correctness is straightforward from the above 
arguments, so we consider the time complexity. Line[l]can be computed in 0{n) 
time. Line [2] can be computed in 0{qn) time by a simple dynamic programming. 
For preQ: It Xi ^ a for some a E S, then pre{Xi, g — 1) = a. If AT^ = XgXr and 
\Xi\ > g-1, thcnpre(A,,g-l) = pre{Xi,,q-l). It X, ^ XgXr and \Xi\ < g-1, 
then pre{Xi, g — 1) = pre{Xi, q — l)pre{Xr, g — 1 — lA^j). The strings suf{) can 
be computed similarly. The computation amounts to copying 0(g) characters 
for each variable, and thus can be done in 0{qn) time. For the loop at line SI 
since the length of string ti appended to z, as well as the number of elements 
appended to w is at most 2(g — 1) in each loop, the total time complexity is 
0{qn). Finally, since the length of z and w is 0{qn), line |9] can be calculated in 
0{qn) time using the weighted version of Algorithm [3] (Algorithm [6]). □ 

Note that the time complexity for using the weighted version of Algorithm [5] 
for lineiniof Algorithm 0] would be at least O(g^n): e.g. 0(g^nlog time and 
0{q^n) space using a trie. 

4 Applications and Extensions 

We showed that for an SLP T of size n representing string T, g-gram frequency 
problems on T can be reduced to weighted g-gram frequency problems on a string 
z of length 0{qn), which can be much shorter than T. This idea can further be 
applied to obtain efficient compressed string processing algorithms for interesting 
problems which we briefly introduce below. 



Algorithm 4: Calculating g-gram frequencies of an SLP for q > 2 

Input: SLP T — {Xi}f^i representing string T, integer q > 2. 
Report: all g-grams and their frequencies which occur in T. 

1 Calculate vOcc{Xi) for all 1 < i < n; 

2 Calculate pre{Xi, q — 1) and suf{Xi, q — 1) for all 1 < i < n — 1 ; 

3 z £; w •<— []; 

4 for i <— 1 to 71 do 

5 if Xi = XiXr and \Xi\ > q then 

6 = suf(Xi, q - l)pre(X,., g - f ); append (t;); 

7 for j ^ 1 to — g + 1 do m.append(7;0cc(Xi)); 

8 for j ^ 1 to g — 1 do ii;.append(0); 

9 Report g-gram frequencies in z, where each g-gram zfi : i + g — 1] is weighted 
by w[i]. 



Algorithm 5: A variant of Algorithm [5] for weighted q-gram frequencies. 
Input: string T, array of integers w of length \T\, integer g > f 
Report: (P, Y.i^Occ{T.p) ™M) for -P G where EigOcc{T,p) "'M > 0- 

1 S <— 0; // empty associative array 

2 for i <- 1 to |T| - g + 1 do 

3 qgram <— r[i : i + g — 1]; 

4 if qgram £ keys(S) then S[qgram] S[ggram] 

5 else if ^[i] > then S[qgram] // new g-gram 

6 for qgram G keys(S) do Report {qgram, S[qgram]) 



4.1 q-gram Spectrum Kernel 

A string kernel is a function that computes the inner product between two strings 
which are mapped to some feature space. It is used when classifying string or text 
data using methods such as Support Vector Machines (SVMs), and is usually 
the dominating factor in the time complexity of SVM learning and classification. 
A g-gram spectrum kernel [16] considers the feature space of g-grams. For string 
T, let (j)q(T) = {\Occ{T,p)\)pesi- The kernel function is defined as Kq{Ti,T2) = 
{(l>q{Ti),(t>q{T2)) = T,p<^Ei \ 0cciTi,p)\\0cc{T2,p)\. The calculation of the kernel 
function amounts to summing up the product of occurrence frequencies in strings 
Ti and T2 for all g-grams which occur in both Ti and T2. This can be done 
in 0(|Ti| + IT2I) time using suffix arrays. For two SLPs 7i and T2 of size ni 
and 712 representing strings Ti and T2, respectively, the g-gram spectrum kernel 
Kq{Ti,T2) can be computed in 0{q{ni + 712)) time by a slight modification of 
our algorithm. 



4.2 Optimal Substring Patterns of Length q 



Given two sets of strings, finding string patterns that are frequent in one set 
and not in the other, is an important problem in string data mining, with many 



Algorithm 6: A variant of Algorithm [3] for weighted g-gram frequencies. 



Input: string T, array of integers w of length |T|, integer g > 1 

Output: (i,Y.i(^occ(T.p) "'H) foi' P & S'^ where Y,i(^occ(T,p) ™M > ^ ^'^'^ 
some position i G Occ{T, P). 

1 ^ SUFFIXARRAY{T); LCP ^ LCPARRAY{T, SA); count ^ 1; 

2 for i ^ 2 to ITI + 1 do 



if I = |T| + 1 or iCP[i] < q then 
1^ if count > then Report {SA[i — 1], count); count ^ 0; 

if i < |T| and SAfi] < |r| - g + 1 then count coMnt + TO[5'^[i]]; 



problem formulations and the types of patterns to be considered, e.g.: in Bioin- 
formatics [3], Machine Learning (optimal patterns [2]), and more recently KDD 
(emerging patterns [J). A simple optimal g-gram pattern discovery problem can 
be defined as follows: Let Ti and T2 be two multisets of strings. The problem is 
to find the g-gram p which gives the highest (or lowest) score according to some 
scoring function that depends only on |Ti|, IT2I, and the number of strings re- 
spectively in Ti and T2 for which p is a substring. For uncompressed strings, the 
problem can be solved in 0{N) time, where N is the total length of the strings 
in both Ti and T2, by applying the algorithm of [3] to two sets of strings. For 
the SLP compressed version of this problem, the input is two multisets of SLPs, 
each representing strings in Ti and T2. If n is the total number of variables 
used in all of the SLPs, the problem can be solved in 0{qn) time. 

4.3 Different Lengths 

The ideas in this paper can be used to consider all substrings of length not 
only q, but all lengths up-to q, with some modifications. For the applications 
discussed above, although the number of such substrings increases to 0{q^n), 
the 0{qn) time complexity can be maintained by using standard techniques of 
suffix arrays |7|13| . This is because there exist only 0{qn) substring with distinct 
frequencies (corresponding to nodes of the suffix tree), and the computations of 
the extra substrings can be summarized with respect to them. 

5 Computational Experiments 

We implemented 4 algorithms (NMP, NSA, SMP, SSA) that count the frequen- 
cies of all g-grams in a given text. NMP (Algorithm [2]) and NSA (Algorithm [3]) 
work on the uncompressed text. SMP (Algorithm |3] -I- Algorithm [5]) and SSA 
(Algorithm m -f Algorithm [S]) work on SLPs. The algorithms were implemented 
using the C++ language. We used std: :map from the Standard Template Li- 
brary (STL) for the associative array implementation. For constructing suffix 

^ We also used std: :hash_map but omit the results due to lack of space. Choosing the 
hashing function to use is difficult, and we note that its performance was unstable 
and sometimes very bad when varying g. 



arrays, we used the divsufsort librarjij developed by Yuta Mori. This implemen- 
tation is not linear time in the worst case, but has been empirically shown to be 
one of the fastest implementations on various data. 

All computations were conducted on a Mac Xserve (Early 2009) with 2 x 
2.93GHz Quad Core Xeon processors and 24GB Memory, only utilizing a sin- 
gle process/thread at once. The program was compiled using the GNU C-I--I- 
compiler (g++) 4.2.1 with the -fast option for optimization. The running times 
are measured in seconds, starting from after reading the uncompressed text into 
memory for NMP and NSA, and after reading the SLP that represents the text 
into memory for SMP and SSA. Each computation is repeated at least 3 times, 
and the average is taken. 



5.1 Fibonacci Strings 

The i th Fibonacci string Fi can be rep- 
resented by the following SLP: Xi = 
b, X2 = a, X, = X,_iX,_2 for 
i > 2, and F, = val(Xi). Fig. [3] 
shows the running times on Fibonacci 
strings -^25, • ■ • , -^95, for g = 50. 
Although this is an extreme case since 
Fibonacci strings can be exponentially 
compressed, we can see that SMP and 
SSA that work on the SLP are clearly 
faster than NMP and NSA which work 
on the uncompressed string. 

5.2 Pizza & Chili Corpus 

We also applied the algorithms 
on texts XML, DNA, ENGLISH, 
and PROTEINS, with sizes 50MB, 
100MB, and 200MB, obtained from 
the Pizza & Chili Corpu^. We used 
RE-PAIR [15] to obtain SLPs for this 
data. 

Table [1] shows the running times 
for all algorithms and data, where q 
is varied from 2 to 10. We see that for 
all corpora, SMP and SSA running on 
SLPs are actually faster than NMP 
and NSA running on uncompressed 
text, when q is small. Furthermore, 
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Fig. 4. Time ratios NMP/SMP and 
NSA/SSA plotted against ratio \z\/\T\. 
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SMP is faster than SSA when q is smaller. Interestingly for XML, the SLP 
versions are faster even for g up to 9. 

Fig.Hshows the same resuhs as time ratio: NMP/SMP and NSA/ SSA, plot- 
ted against ratio: (length of z in Algorithm!!])/ (length of uncompressed text). As 
expected, the SLP versions are basically faster than their uncompressed coun- 
terparts, when |z|/(text length) is less than 1, since the SLP versions run the 
weighted versions of the uncompressed algorithms on a text of length \z\. SLPs 
generated by other grammar based compression algorithms showed similar ten- 
dencies (data not shown). 

6 Conclusion 

We presented an 0{qn) time and space algorithm for calculating all g-gram 
frequencies in a string, given an SLP of size n representing the string. This solves, 
much more efficiently, a more general problem than considered in previous work. 
Computational experiments on various real texts showed that the algorithms run 
faster than algorithms that work on the uncompressed string, when q is small. 
Although larger values of q allow us to capture longer character dependencies, 
the dimensionality of the features increases, making the space of occurring q- 
grams sparse. Therefore, meaningful values of q for typical applications can be 
fairly small in practice (e.g. 3 ^ 6), so our algorithms have practical value. 

A future work is extending our algorithms that work on SLPs, to algorithms 
that work on collage systems [14] . A Collage System is a more general framework 
for modeling various compression methods. In addition to the simple concatena- 
tion operation used in SLPs, it includes operations for repetition and prefix/suffix 
truncation of variables. 

This is the first paper to show the potential of the compressed string process- 
ing approach in developing efficient and practical algorithms for problems in the 
field of string mining and classification. More and more efficient algorithms for 
various processing of text in compressed representations are becoming available. 
We believe texts will eventually be stored in compressed form by default, since 
not only will it save space, but it will also have the added benefit of being able 
to conduct various computations on it more efficiently later on, when needed. 
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Table 1. Running times in seconds for data from the Pizza & Chili Corpus. Bold 
numbers represent the fastest time for each data and q. Times for SMP and SSA 
are prefixed with >, if they become fastest when all algorithms start from the SLP 
representation, i.e., NMP and NSA require time for decompressing the SLP (denoted 
by decompression time). The bold horizontal lines show the boundary where \z\ in 
Algorithmic exceeds the uncompressed text length. 
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